A New Approach Towards Vertical Search Engines - Intelligent Focused Crawling and Multilingual Semantic Techniques
نویسندگان
چکیده
Search engines typically consist of a crawler which traverses the web retrieving documents and a search frontend which provides the user interface to the acquired information. Focused crawlers refine the crawler by intelligently directing it to predefined topic areas. The evolution of search engines today is expedited by supplying more search capabilities such as a search for metadata as well as search within the content text. Semantic web standards have supplied methods for augmenting webpages with metadata. Machine learning techniques are used where necessary to gather more metadata from unstructured webpages. This paper analyzes the effectiveness of techniques for vertical search engines with respect to focused crawling and metadata integration exemplarily in the field of “educational research”. A search engine for these purposes implemented within the EERQI project is described and tested. The enhancement of focused crawling with the use of link analysis and anchor text classification is implemented and verified. A new heuristic score calculation formula has been developed for focusing the crawler. Full-texts and metadata from various multilingual sources are collected and combined into a common format.
منابع مشابه
Focused Crawling Using Latent Semantic Indexing - An Application for Vertical Search Engines
Vertical search engines and web portals are gaining ground over the general-purpose engines due to their limited size and their high precision for the domain they cover. The number of vertical portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link ana...
متن کاملCombining Text and Link Analysis for Focused Crawling
The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information r...
متن کاملSelf Ranking and Evaluation Approach for Focused Crawler Based on Multi-Agent System
The need of better way of retrieving information and dealing with the increasing complexity and volume of information for users is an important research theme. Retrieving information from the www via search engine may be deliberate as the most significant one. Most of the recent efforts that had been done in this area suggest a better solution for general-purpose search engine limitations. That...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملA New Approach for Building a Scalable and Adaptive Vertical Search Engine
Search engines are the most important search tools for finding useful and recent information on the Web today. They rely on crawlers that continually crawl the Web for new pages. Meanwhile, focused crawlers have become an attractive area for research in recent years. They suggest a better solution for general-purpose search engine limitations and lead to a new generation of search engines calle...
متن کامل